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Abstract.  Several  methods  have  recently  been  proposed  for  verifying 
processors  with  out-of-order  execution.  These  methods  use  intermediate 
abstractions  to  decompose  the  verification  process  into  smaller  steps.  Un¬ 
fortunately,  the  process  of  manually  creating  intermediate  abstractions 
is  very  laborious.  We  present  an  approach  that  dramatically  reduces  the 
need  for  an  intermediate  abstraction,  so  that  only  the  scheduling  logic 
of  the  implementation  is  abstracted.  After  the  abstraction,  we  apply  an 
enhanced  incremental-flushing  approach  to  verify  the  remaining  circuitry 
by  comparing  the  processor  description  against  itself  in  a  slightly  simpler 
configuration.  By  induction,  we  demonstrate  that  any  reachable  config¬ 
uration  is  equivalent  to  the  simplest  possible  configuration.  Finally,  we 
prove  correctness  on  the  simplest  configuration.  The  approach  is  illus¬ 
trated  with  a  simple  example  of  an  out-of-order  execution  core. 


1  Introduction 

Several  techniques  for  formally  verifying  out-of-order  microprocessor  designs  us¬ 
ing  theorem  proving  have  recently  been  suggested  [4, 10-12].  These  techniques  all 
use  some  form  of  intermediate  abstraction  to  bridge  the  gap  in  abstraction  level 
between  the  implementation  and  the  specification,  as  defined  by  an  instruction- 
set  architecture  (ISA). 

Creating  such  intermediate  abstractions  manually  and  then  showing  the  cor¬ 
respondence  between  the  implementation  and  the  intermediate  abstraction  is 
laborious,  even  for  high-level  models.  Omitting  the  intermediate  abstraction  and 
manually  developing  the  abstraction  relation  between  the  implementation  and 
the  ISA  is  even  harder.  First,  the  extended  instruction  parallelism  in  out-of-order 
architectures  results  in  many  complex  interactions  between  executing  instruc¬ 
tions.  This  greater  complexity  makes  it  very  difficult  to  devise  an  abstraction 
function.  Second,  large  (>  40  element)  buffers  are  used  to  record  and  madntain 
the  program  order  of  instructions. 

Burch  and  Dill  have  devised  an  approach  for  pipelined  microarchitectures 
that  automatically  generates  the  abstraction  function  by  flushing  the  implemen¬ 
tation  state  [3].  The  technique  has  been  extended  to  dual-issue  and  super-scalar 
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architectures  [2, 8, 13] .  However,  these  techniques  do  not  work  for  out-of-order  ax- 
chitectures  in  practice  because  the  number  of  cycles  required  to  empty  the  buffer 
completely  is  so  large.  The  logical  formulas  are  too  complex  to  manipulate  in 
proofs  and  often  too  complex  even  to  construct. 

We  have  previously  proposed  incremental  flushing^  an  extension  to  the  Burch 
and  Dill  flushing  approach  that  inductively  empties  the  buffer  in  smaller  proof 
steps  [12].  We  have  applied  it  to  automate  part  of  the  verification  process  of 
out-of-order  designs.  The  approach  requires,  however,  that  the  out-of-order  core 
is  abstracted  into  an  in-order  version.  In  this  paper,  we  extend  the  incremental- 
flushing  approach  to  directly  reason  about  the  out-of-order  core  also.  This  avoids 
the  need  for  the  in-order  abstraction  of  our  earlier  approach.  The  implementation 
abstraction  that  is  still  required  is  comparatively  minimal,  and  the  automated 
incremental  flushing  approach  can  cover  a  much  larger  portion  of  the  original  de¬ 
sign.  This  automates  the  generation  of  the  abstraction  function  and  significantly 
reduces  the  manual  effort  required. 

The  extended  technique  only  requires  that  the  internal  scheduling  logic  of 
the  processor  be  manually  abstracted.  An  instruction  is  processed  through  a 
number  of  internal  steps,  which  each  may  take  several  cycles.  The  scheduling 
logic  affected  determines  which  bufier  entries,  datapath  resources,  and  busses 
different  instructions  and  steps  are  assigned  to.  We  apply  induction  to  show  that 
the  implementation  executing  any  number  of  instructions  (up  to  the  maximum 
allowed)  is  functionally  equivalent  with  the  same  implementation  executing  only 
one  instruction  at  a  time.  We  finally  complete  the  verification  by  checking  the 
implementation  with  one  instruction  against  the  ISA.  This  proof  is  much  simpler, 
since  the  bypass  and  buffering  logic  can  be  simplified  away  in  the  proofe.  Note 
that  to  make  the  induction  work,  it  must  be  possible  to  stall  each  stage  of  the 
out-of-order  pipeline  independently. 

We  use  the  same  simple  model  of  an  out-of-order  execution  core  to  illus¬ 
trate  our  approach  that  we  used  previously  [12].  Although  this  example  is  not 
representative  of  industrial-scale  designs,  it  captures  essential  features  of  out-of- 
order  architectures:  large  queuing  buffers,  resource  allocation  within  the  buffers, 
and  data-path  scheduling  of  execution  resources.  We  have  discharged  the  proof 
obligations  for  the  simple  example  using  the  Stanford  Validity  Checker  (SVC). 


2  Related  Work 

Sawada  and  Hunt’s  theorem-proving  approach  uses  a  table  of  history  variables, 
called  a  micro-architectural  execution  trace  table  (MAETT)  [10, 11].  The  MAETT 
is  an  intermediate  abstraction  that  contains  selected  parts  of  the  implementa¬ 
tion  as  well  as  extra  history  variables  and  variables  holding  abstracted  values. 
It  includes  the  ISA  state  and  the  ISA  transition  function.  A  predicate  relating 
the  implementation  and  MAETT  is  found  by  manual  inspection  and  proven  by 
induction  to  be  axi  invariant  on  the  execution  of  the  implementation.  In  our 
approach,  we  do  not  need  an  intermediate  abstraction  of  the  circuit,  only  the 
scheduling  logic  is  abstracted.  We  then  use  an  incremental  flushing  technique 


to  automatically  generate  the  abstraction  function,  reducing  the  manual  work 
required  to  relate  the  intermediate  abstraction  to  the  ISA. 

Damm  and  Pnueli  generalize  an  ISA  specification  to  a  non-deterministic  ab- 
straotion  [4].  They  verify  that  the  implementation  satisfies  the  abstraction  by 
manually  establishing  and  proving  the  appropriate  invariants.  They  have  applied 
their  technique  to  the  Tomasulo  algorithm  [5],  which  has  out-of-order  instruction 
completion.  In  contrast,  our  out-of-order  model  features  in-order  retirement  and 
the  corresponding  large  buffers  that  are  required.  Damm  and  Pnueli’s  abstrac¬ 
tion  non-deterministically  represents  all  possible  instruction  sequences  which  ol> 
serve  dataflow  dependencies.  Our  non-deterministic  scheduler  abstraction  also 
observes  dataflow  dependencies,  but  is  additionally  constrained  by  allowable  re¬ 
source  allocations  (e.g.,  buffer  entries)  in  the  implementation.  Applying  their 
method  to  architectures  with  in-order  retirement  would  require  manual  proof  by 
induction  that  the  intermediate  abstraction  satisfies  the  ISA.  We  automate  the 
proof  obligations  with  incremental  flushing. 

Hosabettu  et  al.  use  a  technique  for  decomposing  the  abstraction  function  and 
have  applied  it  to  the  example  of  Sawada  and  Hunt  with  out-of-order  retirement 
[7].  Although  this  aids  in  finding  an  appropriate  abstraction  function,  manual 
intervention  is  needed  in  its  construction. 

Henzinger  et  al.  use  Tomasulo’s  algorithm  to  illustrate  a  method  for  manu¬ 
ally  decomposing  the  proof  of  correctness  [6].  They  manually  provide  abstract 
modules  for  parts  of  the  implementation.  These  modules  correspond  to  imple¬ 
mentation  internal  steps.  Similar  to  our  approach,  the  abstractions  are  invariants 
on  the  implementation  and  are  extended  with  auxiliary  variables.  Again,  our  new 
approach  automates  much  of  the  abstraction  process. 

McMillan  model  checks  the  Tomasulo  algorithm  by  manually  decomposing 
the  proof  into  smaller  correctness  proofs  of  the  internal  steps  [9].  He  also  uses  a 
reduction  technique  based  on  symmetry  to  extend  the  proof  to  a  large  number 
of  execution  units.  Berezin  et  al.  abstract  the  data  path  by  introducing  a  data 
structure  called  a  reference  table.  E^ach  entry  in  the  reference  table  corresponds 
to  an  uninterpreted  term  representing  computation  results  of  instructions  [1]. 
They  have  applied  their  technique  to  Tomasulo’s  algorithm.  However,  the  size  of 
the  state  space  grows  exponentially  with  the  number  of  concurrent  instructions. 
Designs  with  in-order  retirement  contain  a  large  reorder  buffer  and  can  con- 
tadn  many  instructions  executing  simultaneously.  In  contrast  to  both  automated 
model-checking  approaches,  our  theorem-proving  based  method  generalizes  to 
arbitrary  buffer  sizes. 

3  Preliminaries 

t 

The  desired  behavior  of  a  processor  is  defined  by  an  instruction-set  architecture 
(ISA).  The  ISA  represents  the  programmer-level  view  of  a  machine  that  executes 
instructions  sequentially.  The  ISA  for  our  example  is  shown  in  Figure  la.  The  ISA 
state  consists  of  a  register  file  (RF)?  while  the  next-state  function  is  computed 
with  a  generic  execution  unit  (EU)  that  can  execute  any  instruction.  The  ISA 


(a)  (b) 


Fig.  1.  (a)  The  simple  ISA  model,  (b)  Instruction  flow  in  our  out-of-order  execution 
core  IMPL. 


also  accepts  a  bubble  input  that  leaves  the  state  unchanged.  Note  that  our  ISA 
model  does  not  include  a  program  counter  or  memory  state — as  these  are  also 
omitted  from  our  simplified  out-of-order  model. 

Modern  processors  implement  the  ISA  more  aggressively.  In  out-of-order  ar¬ 
chitectures,  instructions  are  fetched,  decoded,  and  sent  to  the  execution  core  in 
program  order.  Internally,  however,  the  core  executes  instructions  out-of-order, 
as  allowed  by  data  dependencies.  This  allows  independent  instructions  to  exe¬ 
cute  concurrently.  Finally,  instruction  results  are  written  back  to  architecturally- 
visible  state  (the  register  file)  in  the  order  they  were  issued. 

Consider  our  example  out-of-order  execution  core  (IMPL)  shown  in  Fig¬ 
ure  lb.  The  architectural  register  file  (RF)  contains  the  current  state  of  the 
ISA-defined  architectural  registers.  An  instruction  is  processed  in  a  number  of 
steps  ^  which  may  each  last  a  number  of  cycles:  When  an  instruction  is  issued^  new 
entries  are  edlocated  in  both  the  dispatch  and  retirement  buffers,  and  the  reg¬ 
ister  translation  table  (RTT)  entry  for  the  logical  register  corresponding  to  the 
instruction  destination  is  updated.  The  RTT  is  used  to  locate  the  instruction’s 
source  data.  Instructions  are  dispatched^  possibly  out-of-order,  from  the  dispatch 
buffer  (DB)  to  individual  execution  units  when  their  operands  are  ready  and  an 
execution  unit  is  available.  When  an  instruction  finishes  execution,  the  result  is 
written  back  to  the  retirement  buffer  (RB).  This  data  is  also  bypassed  into  the 
DB  for  instructions  awaiting  that  particular  result.  Finally,  the  RB  logic  must 
ensure  that  instruction  results  are  retired  (committed  to  architectural  state)  in 
the  original  program  order.  When  an  RB  entry  is  retired,  the  RTT  is  informed 
so  that  the  logical  register  entry  corresponding  to  the  instruction’s  destination 
can  be  updated  if  necessary.  IMPL  also  accepts  a  special  bubble  flushing  input 
in  place  of  an  instruction.  Intuitively,  a  bubble  is  similar  to  a  NOP  instruction  but 
does  not  affect  any  state  or  consume  any  resources  after  being  issued. 

Figure  lb  also  shows  the  scheduling  logic  ^  which  handles  the  allocation  of 
hardware  resources  and  instruction  flow.  Scheduling  must  determine  (1)  which 
slot  in  the  DB  to  allocate  at  issue,  (2)  when  to  dispatch  a  ready  instruction  and 


which  EU  to  dispatch  it  to,  (3)  when  an  EU  writes  back  a  completed  execution 
result,  and  (4)  when  to  retire  a  completed  instruction.  We  call  this  collection  of 
resource  allocation  and  dataflow  decisions  from  the  scheduling  logic  the  choice 
for  a  given  cycle. 

There  are  obviously  many  sound  scheduling  algorithms,  and  many  allowable 
scheduling  choices  exist  for  a  given  configuration.  Which  choices  are  allowable  is 
determined  by  the  state  of  other  instructions  and  available  hardware  resources. 
For  example,  a  sound  but  ineflScient  scheduling  algorithm  would  only  allow  one 
instruction  to  execute  at  a  time — greatly  simplifying  the  interaction  between  in¬ 
structions.  An  optimal  scheduling  algorithm  would  execute  instructions  in  what¬ 
ever  dataflow  order  makes  the  best  use  of  execution  resources.  An  implementable 
scheduling  algorithm  falls  somewhere  in  the  middle  and  must  balance  execution 
performance  against  implementation  considerations. 

We  have  made  significant  simplifying  assumptions  in  our  processor  model: 
instructions  have  only  one  source  operand,  and  only  one  issue  and  one  retire 
can  occur  each  cycle.  We  also  omit  a  “front-end”  with  fetch,  decode,  and  branch 
prediction  logic.  Omitting  these  features  allowed  our  efforts  to  focus  on  the 
features  which  make  the  out-of-order  verification  problem  difficult:  the  out-of- 
order  execution  and  the  large  effective  depth  of  the  pipeline.  The  verification 
discussed  in  this  paper  uses  a  model  with  unbounded  buffers. 


4  The  Approach 

As  in  [12],  the  goal  of  our  approach  is  to  prove  that  the  out-of-order  implemen¬ 
tation  IMPL  (as  described  by  an  HDL  model)  satisfies  the  ISA  model.  We  define 
6i  to  be  the  implementation  next-state  function,  which  takes  an  initial  state  qi 
and  an  input  instruction  i  and  returns  a  new  state  e.g.,  q^  =  We 

extend  5i  in  the  obvious  way  to  operate  over  input  sequences  w  =  We 

define  6s  similarly  for  ISA. 

Let  (7  be  a  size  function  that  returns  the  number  of  currently  executing  in¬ 
structions,  i.e.,  those  that  have  been  issued  but  not  retired.  We  require  that 
cr(g9)  z=  0  for  an  initial  implementation  state  g?.  We  define  an  instruction  se¬ 
quence  w  to  be  completed  iff  or(5^(g?,u;))  =  0,  i.e.,  all  instructions  have  been 
retired  after  executing  w.  We  use  the  projection  function  to  denote  the 

register  file  contents  in  state  qi  -  which  we  define  as  the  specification  state.  For 
clarity  in  presentation,  we  define  qn  ^  qi2  to  be  7rRF(gii)  =  7rRF(gi2)5  and  we 
will  sometimes  use  =  when  the  projection  tTrf  is  redundant  on  one  side  of  the 
equality. 

The  overall  correctness  property  for  IMPL  with  respect  to  ISA  is  expressed 
formally  as: 

Correctness  For  every  completed  instruction  sequence  w  and  initial  state 

Siiqt.w)  =Ss{7rRpiqt)yW). 


That  is,  the  architecturally  visible  state  in  IMPL  and  ISA  is  identical  after 
executing  any  instruction  sequence  that  retires  all  outstanding  instructions  in 
the  implementation. 

Our  approach  has  three  steps.  First,  we  locate  and  abstract  the  IMPL  schedul¬ 
ing  logic  and  prove  the  abstraction  correct.  We  refer  to  the  abstracted  implemen¬ 
tation  as  SAI  (scheduler-abstracted  implementation).  In  the  second  step,  we  use 
incremental  flushing  to  show  that  SAI  with  an  abstracted  scheduler  calculates 
the  same  results  as  if  the  instructions  were  executed  one  at  a  time.  Note  that 
while  the  functional  results  should  be  identical,  the  timing  of  the  results  will  of 
course  be  different.  This  proves  the  correctness  of  the  reordering  control  logic. 
Finally,  we  show  that  SAI  with  an  abstracted  scheduler  executing  one  instruction 
at  a  time  satisfies  the  ISA. 


5  First  Step:  Abstracting  the  Scheduling  Logic 


We  first  identify  the  scheduling  logic  in  the  design  and  its  interface  to  the  rest 
of  the  circuit.  We  wish  to  replace  the  original  scheduling  logic  with  the  most 
general  scheduling  algorithm  that  still  provides  legal  choices  to  the  rest  of  the 
circuit.  For  example,  the  abstracted  scheduling  logic  for  our  simple  example  will 
(1)  issue  an  instruction  to  any  empty  slot  in  the  DB,  (2)  dispatch  an  instruction 
to  any  available  execution  unit,  (3)  write  back  results  from  any  execution  unit 
that  has  finished  executing,  and  (4)  retire  any  instruction  with  result  data.  In  a 
given  state,  the  abstracted  scheduling  logic  in  SAI  non-deterministically  chooses 
an  allocation  based  on  the  current  state  of  the  SAI.  The  non-determinism  is 
implemented  as  an  extra,  unconstrained  input. 


Non-dctcrniinistic  input 


Fig.  2.  Instruction  flow  in  SAI  with  the  abstracted  resource  allocator. 


The  SAI  with  an  abstracted  scheduler  is  illustrated  in  Figure  2.  The  ab¬ 
stract  scheduler  monitors  the  state  of  SAI  and  provides  SAI  with  a  scheduling 
choice  for  every  instruction  input.  Naturally,  we  want  the  abstracted  scheduler 
to  make  legal  choices  that  only  allocate  free  resources  and  advance  only  ready 
instructions  from  one  stage  to  the  next.  For  example,  only  instructions  that  have 
completed  executing  may  be  written  back  and  retired.  Identifying  and  abstract¬ 
ing  the  scheduling  logic  in  a  realistic  design  requires  a  detmled  understanding 
of  the  circuit  and  may  be  error-prone.  Fortunately,  soundness  of  our  approach 
is  not  compromised  by  a  bad  selection  of  abstracted  scheduler.  The  later  proof 
steps  will  fail  if  the  abstracted  scheduler  in  SAI  is  either  incorrect  or  too  gener^ 
to  verify  its  behavior  against  ISA.  Note,  however,  that  we  do  not  require  the 
scheduler  to  be  centralized.  The  technique  is  equally  applicable  to  a  distributed 
scheduler,  where  each  part  of  the  scheduler  is  appropriately  abstracted. 

We  first  show  that  the  abstract  scheduler  is  sufficiently  general  to  capture 
all  the  possible  choice  outputs  that  the  implementation  scheduler  makes.  We 
then  extend  this  result  with  a  composition  argument  to  show  that  SAI  with 
the  abstracted  scheduler  is  an  appropriate  abstraction  of  IMPL.  Let  Si  be  the 
transition  function  of  the  implementation  scheduler  and  let  Sa  be  the  transition 
function  of  the  abstract  scheduler.  Sa  takes  an  extra,  non-deterministic  input 
ind*  We  must  show  that  for  each  step  that  Si  makes,  there  exists  an  Sa  step  such 
that  the  choice  outputs  are  identical: 

Proof  Obligation  1  (Scheduler  Abstraction  Correctness)  For  every  reachable 
state  Qi  of  IMPL  and  for  every  input  i,  there  exists  an  input  ind  such  that 

out{Si{qi,i))  =  OUt{SaiqaJJnd))- 

One  way  of  instantiating  the  abstract  scheduler  for  this  proof  is  to  use  an  or¬ 
acle  which  observes  the  original  scheduler’s  behavior  and  knows  how  the  non- 
deterministic  input  affects  the  abstract  scheduler. 

Next,  we  must  establish  that  SAI  with  the  abstracted  scheduler  is  an  ap¬ 
propriate  abstraction  of  IMPL.  We  define  Sa  to  be  the  SAI  next-state  function, 
which  takes  an  initial  state  qa  and  a  pair  consisting  of  an  input  instruction  i 
and  scheduler  choice  ch  and  returns  a  new  state  e.g.,  =  ^0(^05  {h  ch))-  We 

extend  the  definition  of  Sa  to  sequences  of  instruction  inputs  w  and  choice  se¬ 
quence  Wch  =  cfto  •  •  •  chn  such  that  (Qa >  •  We  say  that  a  choice 

sequence  Wch  is  5o(go,  u;)-generated,  if  it  is  obtained  by  stimulating  the  abstract 
scheduler  to  provide  a  sequence  of  choices  corresponding  to  the  instruction  se¬ 
quence  w  from  the  state  ga-  We  define  states  g^  of  IMPL  and  qa  of  SAI  to  be 
consistent  when  g,-  =  go,  i.e.,  they  have  identical  architecturally  visible  states. 
Using  Proof  Obligation  1  and  a  composition  argument,  we  can  prove  that: 

IMPL-SAI  Refinement  For  every  instruction  sequence  w  and  every  pair  of 
consistent  initial  states  qf,  g°,  there  exists  a  Si{ql^w)-generated  choice  sequence 


^  The  pedr  of  sequences  (lu,  iWch}  is  easily  derived  from  the  corresponding  sequence  of 
pairs  (io, c/io),  •  ■  • ,  {*n,  chn). 


Wch  such  that 


SiiQh^) 

We  prove  this  by  providing  the  following  witness.  By  induction,  we  extend  Proof 
Obligation  1  to  work  on  sequences  of  inputs  and  obtain  a  5o(g°,iu)’’generated 
sequence  Wch  that  is  equal  to  the  sequence  that  is  output  from  the  implemen¬ 
tation  scheduler.  Since  SAI  was  obtained  from  IMPL  by  abstracting  only  the 
resource  allocation  logic,  the  property  follows  trivially. 

Note  that  this  proof  requires  reachability  invariants  for  IMPL  and  SAI.  Find¬ 
ing  the  reachability  invariant  for  IMPL  is  necessary  for  any  inductive  methocl, 
and  is  not  unique  to  our  approach.  Finding  the  reachability  invariant  for  SAI  is 
straightforward,  because  of  the  minimal  changes  from  IMPL. 

6  Second  Step:  Functional  Equivalence  of  SAI  and  ISA 

The  second  step  in  the  verification  is  to  prove  that  SAI  with  the  abstract  sched¬ 
uler  satisfies  ISA.  Formally: 

SAI-ISA  Equivalence  For  every  completed  instruction  sequence  w,  initial  SAI 
state  and  5a(g2,ty)-^enemted  sequence  of  choices  Wch' 

^o(9o>(w,tWch))  =  SsiTTRpiqD^w). 


Recall  that  the  Burch-Dill  abstraction  function  flushes  an  implementation 
(by  inserting  bubbles)  for  the  number  of  clock  cycles  necessary  to  completely 
expose  the  internal  state.  In  the  case  of  a  simple  five-stage  pipeline,  only  five 
steps  are  required  to  complete  the  partially  executed  instructions.  Following 
this  approach  with  our  model  would  compare  a  potentially  full  RB  with  the 
ISA  model.  The  Burch-Dill  flushing  technique  would  unroll  SAI  to  the  depth  of 
the  RB,  resulting  in  a  logical  expression  too  large  for  the  decision  procedure  to 
check. 

We  extend  the  incremental-flushing  approach  presented  in  [12]  to  overcome 
this  problem.  Rather  than  flushing  the  entire  pipeline  directly,  a  set  of  smaller, 
inductive  flushing  steps  is  performed.  Taken  together,  these  proof  obligations 
imply  the  full,  monolithic  flushing  operation.  To  illustrate  the  approach,  consider 
the  graphical  presentation  of  two  different  executions  (state  sequences)  of  SAI  in 
Figure  3.  We  define  the  execution  of  a  system  as  the  sequence  of  states  that  the 
system  passes  through  when  executing  a  given  input  sequence.  For  instance,  the 
execution  indicated  in  Figure  3a  is  a  result  of  executing  the  instruction  sequence: 

*  1 )  ^  2 )  bubble ,  bubble ,  1*3 ,  bubble ,  14,^5,  bubble ,  bubble ,  ,  bubble ,  bubble . 

with  some  choice  sequence  that  appropriately  allocates  the  resources  so  that  all 
instructions  have  retired  in  the  final  state  state.  Apart  from  self-loops  indicat¬ 
ing  internal  execution,  edges  are  only  traversed  when  instructions  are  issued  or 
retired. 


I  issue,  no  retire  (o’ -a+1)  • - ^  no  issue,  retire  (o’ =  0-1)  issue,  retire  (o’ =  o)  no  issue,  no  retire  (o' =  o) 


(a)  (b) 


Fig.  3.  (a)  A  Max-n  execution  en-  Labels  in  etnd  rn  denote  the  issue  and  retirement  of 
instruction  number  n.  The  label  rn\\in  denotes  simultaneous  issue  and  retire,  r  :  n  is  a 
shorthand  for  n  cycles  where  in  each  cycle,  bubbles  are  issued  and  nothing  is  retired, 
(b)  An  equivalent  Max-1  execution  ei .  The  squares  indicate  the  distance  between  £n 
and  Cl. 

We  use  e{qa,{w,Wck))  to  denote  the  execution  resulting  from  the  appli¬ 
cation  of  Sa  to  a  state  qa  and  the  input  sequence  pair  {w^Wch)-  We  define 
last{e{qay  (w^Wch)))  as  the  last  state  of  the  execution.  Note  that,  by  definition 
last{£{qa^  {w^Wch)))  =  Saiqay  Each  state  in  execution  is  associated 

with  the  number  of  active  instructions — defined  earlier  as  the  size  function  a. 
This  is  illustrated  in  Figure  3b.  We  call  an  execution  which  contains  states  with 
most  size  n  a  Afax-n  execution  (denoted  e„).  Accordingly,  completely  serialized 
executions  with  at  most  one  outstanding  element  are  Afox-1  executions  (denoted 
ei).  An  example  of  a  Mox-1  execution  corresponding  to  the  execution  above  could 
be 

il ,  bubble^ ,  ,  bubble"^ ,  ,  bubble^ ,  14,  bubble"* ,  ,  bubble"* ,  bubble"*. 

where  bubble^  =  bubble, bubble, bubble, bubble.  The  execution  is  illustrated  in 
Figure  3b. 

The  first  step  of  the  SAI-ISA  verification  establishes  that: 

Incremental-Flushing  Induction  Step  For  every  initial  state  ql,  and  for 
every  Max-n  execution  en{qay  {^^Pch))j  there  exists  derived  from  input 

pair  £n{qai{^^'^ch))  and  a  corresponding  Max-1  execution  £i{qay 
that: 

last(e„(g°,  {w,Wch)))  =  last(ei(92, 

A  Max-1  execution  is  derived  from  a  Max-n  execution  by  “stretching”  the  w 
and  Wch  sequences  with  the  appropriate  bubbles  and  stalling  choices,  respec¬ 
tively,  to  stall  the  relevant  parts  of  the  out-of-order  core.  The  intuition  behind 


this  approach  is  that  the  final  results  of  Max-n  and  Max-1  executions  should 
be  identical — because  bubbles  and  stalling  choices  should  not  affect  functional 
behavior.  Clearly,  if  enough  bubbles  are  inserted  between  subsequent  instruc¬ 
tions  only  one  instruction  will  be  in  the  pipeline  at  a  time.  In  this  situation  it  is 
computationally  manageable  to  compare  SAI  with  ISA,  since  the  bypass  control 
logic  can  be  discarded  in  the  proof.  Section  6.1  details  the  proof  obligations  for 
this  step  and  describes  how  we  proved  this  property  on  our  example. 

The  second  SAI-ISA  verification  step  shows  that  all  Max-1  executions  pro¬ 
duce  the  same  result  as  the  ISA  model- 

incremental  Flushing  ISA  Step  For  every  initial  state  q^,  and  every  Max-1 
execution  €%  corresponding  to  an  instruction  sequence  and  every  Si{ql^w^)- 
generated  choice  sequence  wl^: 

lastieiiql,  {w^,wl^)))  =  SsiT:Rpiql),w). 

Proving  this  is  much  simpler  than  the  original  problem  of  directly  proving  SAI- 
ISA  equivalence,  since  only  one  instruction  is  in  the  machine  at  any  given  time 
(because  of  the  stretching  bubbles  and  stalling  choices).  The  proof  is  carried  out 
by  induction  on  the  length  of  instruction  sequences,  as  described  in  Section  6.2. 


6.1  Inductive  Step 

The  incremental  flushing  proof  step  can  be  split  up  into  three  proof  obligations. 
First,  we  identify  the  maximum  number  of  cycles  required  to  symbolically  simu¬ 
late  the  implementation  in  order  to  ensure  that  at  least  one  instruction  is  retired. 
This  is  used  to  prove  termination  of  the  induction  proof.  Let  6^  denote  n  cycles 
of  symbolic  execution.  Formally,  we  must  prove  that: 

Proof  Obligation  2  (Retirement  Upper-Bound)  There  exists  an  upper  bound 
Uf  such  that  for  every  reachable  state  qa  such  that  a{qa)  >  1  and  input  sequence 
pair  (w^Wch)}  at  least  one  active  instruction  from  qa  will  be  retired  between  qa 
and6^{qaA'^^‘^ch))- 

That  is,  we  make  a  progress  assumption  that  the  implementation  retires  an 
instruction  within  u  cycles.  We  derive  u  by  a  worst-case  analysis  and  determine 
the  longest  path  that  an  issued  instruction  could  potentially  follow  before  being 
retired. 

The  upper  bound  u  is  assumed  in  the  main  induction.  As  we  shall  see,  the 
induction  case  is  used  to  inductively  move  the  last  issued  instruction  to  the  end  of 
the  execution  sequence.  In  each  application,  independently  executing  instruction 
steps  are  reordered.  This  reordering  is  performed  by  moving  the  instruction  till 
after  the  steps  of  the  previously  issued  instructions. 

In  each  application  of  the  induction  case,  a  subsequence  is  selected  out  of  the 
execution  such  that  an  instruction  i  is  issued  in  the  first  cycle  of  the  subsequence. 
We  denote  the  length  of  the  subsequence  by  and  will  choose  it  to  be  >  u.  The 
length  of  the  subsequence  is  doubled  in  the  application  of  the  induction  case: 


the  V  choices  are  split  up  in  a  way  that  the  first  v  steps  allow  SAI  to  perform 
all  steps  that  are  not  dependent  on  i.  The  steps  related  to  i  are  then  replayed  in 
the  remaining  cycles.  As  a  consequence,  the  freshly-issued  instruction  i  and  its 
steps  are  delayed  by  v  cycles. 


Fig.  4.  (a)  A  choice  sequence  Wch‘  (b)  A  stretched  version  Wch  of  the  original  choice 
sequence  Wch- 


To  illustrate,  consider  the  scheduling  sequence  Wch  shown  in  Figure  4a.  Each 
vertical  box  corresponds  to  a  choice  and  the  labels  in,  dn,  wn,  and  rn  respec¬ 
tively  denote  which  dispatch  buffer  entry  to  store  an  issued  instruction  in,  which 
dispatch  buffer  entry  to  dispatch,  which  completed  instruction  to  write  back,  and 
whether  or  not  to  allow  retirement  of  an  instruction  ready  for  retirement.  Each 
number  identifies  a  particular  instruction  n.  For  instance,  the  first  choice  retires 
instruction  1,  writes  back  instruction  2,  and  issues  instruction  4.  A  choice  field 
which  keeps  a  particular  resource  allocation  unchanged  is  denoted  with  “ — ” . 

A  scheduling  sequence  Wch  is  constructed  by  adding  bubbles  and  stalling 
choices  to  Wch  (Figure  4b).  Observe  that  the  ordering  of  the  issue,  dispatch, 
writeback,  and  retirement  choices  for  a  given  instruction  are  maintained.  The 
only  difference  is  the  delayed  issue  of  instruction  4  and  its  subsequent  dispatch 
and  writeback.  On  a  per-instruction  basis,  the  resources  in  Wch  and  ^ch  tanst  be 
the  same  and  occur  in  the  same  order.  This  crucial  requirement  guarantees  that 
the  resulting  partially-executed  state  is  the  same  in  both  cases  and  facilitates  an 
inductive  proof  over  SAI  state. 

In  the  induction  case,  the  length  of  the  subsequence,  t;,  must  be  chosen  so 
that  it  is  at  least  u  cycles  and  long  enough  to  make  sure  that  the  instruction 
can  properly  be  moved  passed  the  steps  of  other  instructions.  In  our  example,  v 
must  be  at  least  double  the  maximum  execution  time  in  an  execution  unit,  i.e., 
which  in  total  is  less  than  2u  (from  Proof  Obligation  2  we  know  that  the  time 
that  any  instruction  spends  in  the  execution  unit  is  less  than  u).  By  doing  this, 


we  are  able  to  delay  the  instruction  sufficiently  far  to  avoid  resource  contention 
when  reordering. 


restrict(<w,Wc>)  replay(<w,Wc>) 


q. 

o>0 


q.  • - 

a>0 

(a) 


(b) 


Fig.  5.  (a)  Illustration  of  Proof  Obligation  3;  the  nodes  are  labelled  with  their  sizes, 
(b)  Illustration  of  Proof  Obligation  4.  We  must  prove  that  self  loops  return  to  the  Sctme 
state,  (c)  Illustration  of  Proof  Obligation  5,  the  ISA  induction  step. 


Given  an  input-sequence  pair  {w,Wch)i  define  restric^{{w,Wch))  to  be  the 
projection  of  all  elements  of  {w^Wch}  not  depending  on  i.  Similarly,  we  define 
replay j{{Wy  Wch))  to  denote  the  projection  of  the  elements  of  (w,  Wch)  that  depend 
on  i.  The  proof  obligation  is  then: 

Proof  Obligation  3  (Incremental  Step)  For  every  reachable  state  Qa  such  that 
(^{Qa)  >  1?  for  every  input-sequence  pair  {w,  Wch)  such  that  the  first  element 
of  w  is  a  non-bubble  instruction  and  Wch  is  Si{qayw) -generated: 

where  {w^Wch)  is  the  concatenation  ofvestnctj{{w^Wch))  andvep[a,yi{{w^Wch))* 

In  other  words,  we  must  show  that  the  stretched  sequence  results  in  the  same 
state  as  the  original  sequence.  The  proof  obligation  is  illustrated  in  Figure  5a. 

As  we  shall  see  below,  in  the  proof  of  Proof  Obligation  3  it  is  sufficient  to 
consider  the  cases  where  the  necessary  resource  is  available  so  that  the  instruction 
being  moved  can  be  scheduled  appropriately  and  avoid  resource  contention.  This 
weakening  assumption  can  be  added  to  the  proof  obligation. 

Note  that  Proof  Obligation  3  requires  also  that  internal  registers  with  auxil¬ 
iary  values  to  agree  on  the  resulting  states.  To  illustrate,  the  replayed  instructions 
in  our  model  may  get  their  source  operands  from  the  RF  rather  than  the  RB. 
The  fields  in  the  dispatch  buffer  indicating  the  physical  sources  of  the  operands 
at  issue  may  differ  and  should  be  set  to  some  reset  value  after  use. 

Also  observe  that  in  each  application  of  the  induction  step,  more  than  one 
instruction  may  retire  within  the  v  steps.  Naturally,  the  worst-case  upper  bound 
u  (number  of  cycles  before  an  instruction  is  guaranteed  to  retire)  and  therefore 
V  may  be  quite  large  in  some  designs  due  to  execution  units  with  long  latencies. 
This  could  result  in  symbolic  expressions  that  are  too  large  to  check.  In  these 
cases,  the  execution  units  and  associated  arbitration  logic  must  be  abstracted 
separately. 


The  final  proof  obligation  states  that  bubble  inputs  with  stalling  choices  do 
not  change  SAI  state  (illustrated  in  Figure  5b): 

Proof  Obligation  4  (Correctness  of  Self-Loops)  For  every  reachable  state  Qa, 
instruction  i,  and  stalling  choice  chgt  : 

^a(^0  5  =  Qa* 

Taken  together,  these  three  proof  obligations  establish  the  Incremental  Flush¬ 
ing  step  of  our  verification,  i.e.  that  every  Max-n  execution  has  a  functionally 
equivalent  Max-1  execution.  We  next  provide  a  brief  sketch  of  the  proof. 

Proof  Sketch: 

We  assume  the  three  Proof  Obligations  shown  above  and  must  show  that  for 
every  Max-n  execution  £„  there  exists  a  corresponding  Afoar-l  execution  e\  such 
that 

en(«:,(w,iUch))  =‘ 

We  prove  this  by  complete  induction  on  the  “distance”  between  the  non-diagonal 
Max-n  execution  and  the  Moar-l  execution  £i ,  where  distance  is  the  number  of 
“squares”  and  “triangles”  that  separate  the  two  executions.  For  example,  eight 
squares  and  two  triangles  separate  the  executions  in  Figures  3a  and  3b. 

First,  if  all  states  in  £n  have  cr  =  0  in  states  where  instructions  are  issued,  then 
we  have  a  Maar-1  sequence  and  are  trivially  done — no  more  than  one  instruction 
is  ever  executed  at  a  time.  This  is  the  base  case. 

Otherwise,  we  reduce  the  distance  by  inductively  moving  the  last  instruction 
issued  in  a  state  of  a  >  1  back  until  a  =  0.  We  repeat  this  until  all  instructions 
do  not  overlap  in  execution  and  thus  obtain  the  base  case. 

In  the  induction,  we  repeatedly  choose  the  last  such  instruction  i  and  identify 
the  choice  subsequence  of  length  v  starting  with  i.  If  necessary,  we  can  make  the 
subsequence  long  enough,  by  extending  with  extra,  trailing  stalling  choices, 
using  Proof  Obligation  4.  We  then  apply  Proof  Obligation  3.  If  we  have  added 
the  previously  mentioned  weakening  assumption  that  resources  are  available  at 
the  end  of  v,  we  can  satisfy  this  by  locating  the  last  place  that  the  resource 
was  freed  and  delay  the  following  rescheduling  till  after  the  v  cycles,  using  Proof 
Obligation  4^. 

We  know  that  the  number  of  internal  steps  between  the  instruction  issue 
and  the  end  of  the  execution  sequence  monotically  decreases  in  each  application, 
since  we  are  moving  the  instruction  passed  at  least  one  step  of  any  kind  in  each 
application.  We  also  know  that  we  are  able  to  move  all  the  internal  steps  of  the 
instruction,  since  the  length  v  is  greater  than  u.  Furthermore,  since  the  instruc¬ 
tion  sequence  is  completed,  we  know  that  we  are  also  moving  the  instruction  past 

^  In  implementations  where  the  freeing  and  scheduling  of  the  resource  overlap  in  time, 
we  can  prove  a  separate  lemma  that  shows  the  correctness  of  the  slight  delay  of  the 
rescheduling  after  the  freeing. 


instruction  retires,  each  time  monotonically  decreasing  the  distance  as  defined 
above  and  eventually  reaching  the  base  case.  The  induction  is  thus  well-founded. 

End  Proof  Sketch 

6.2  ISA  Step 

The  final  verification  step  is  to  show  that  all  Max-\  executions  of  SAI  are  func¬ 
tionally  equivalent  with  ISA.  Because  the  instruction  sequence  completes  all 
executions  (i.e.,  leaves  no  outstanding  instructions  in  the  pipeline),  we  can  divide 
it  up  into  issue-retire  fragments  in  the  Max-1  execution.  We  can  assume  that 
each  fragment  has  length  u,  since  if  one  does  not,  we  can  apply  Proof  Obliga¬ 
tion  4  to  add  or  remove  the  necessary  stalling  cycles.  The  proof  is  an  induction 
on  the  number  of  such  fragments,  comparing  the  execution  and  retirement  of  an 
arbitrary  instruction  from  an  ftrbitrary  Max-1  initial  state  with  the  result  that 
is  retired  by  ISA.  This  is  illustrated  in  Figure  5c.  Formally: 

Proof  Obligation  5  (SAI-ISA  Induction)  For  every  initial  lA  state  in¬ 
struction  z,  and  input  sequence  pair  {w,Wch)  of  length  u  containing  only  i  as 
its  first  instruction: 


S^(g°,  (w,Wch))  =  ^e(^Rp(Qa)>0- 

Because  we  have  previously  shown  that  a  functionally  equivalent  Max-1  execu¬ 
tion  can  be  derived  from  an  arbitrary  Max-n  execution,  this  step  completes  the 
proof  of  SAI-ISA  equivalence. 


7  Mechanical  Verification 

We  have  mechanically  checked  our  simple  SAI  abstraction  and  Proof  Obligations 
3-5  for  our  example  using  the  Stanford  Validity  Checker  (SVC).  The  proofe 
finished  in  minutes.  The  three  models  (IMPL,  SAI,  and  ISA)  and  the  proof 
obligations  are  written  in  a  Lisp-like  HDL.  The  proof  formulas  are  constructed 
by  symbolically  simulating  the  models  in  Lisp.  SVC  is  invoked  through  a  foreign- 
function  interface  to  decide  the  validity  of  the  formulas. 

The  mechanical  verification  of  Proof  Obligation  3  has  exposed  several  bugs 
in  the  way  the  choice  signals  were  introduced  in  the  SAL  For  instance,  the 
original  formulation  did  not  stall  retirement  properly.  This  was  detected  in  the 
verification  with  SVC  when  a  stretched  execution  retired  an  instruction  that  the 
original  execution  did  not.  This  illustrates  the  ability  of  the  incremental-flushing 
step  to  detect  possible  bugs  in  the  exposing  of  the  scheduler  interface. 

We  were  able  to  locate  the  error  using  the  counter  example  information  that 
SVC  produced  when  the  error  was  reached.  The  counter  example  is  a  conjunction 
of  predicates  satisfied  in  the  interpretation  that  falsifies  the  proof  obligation.  The 
user  can  apply  this  information  in  the  context  of  the  original  system  model  to 
debug  the  error. 


8  Discussion 


This  work  addresses  a  recurring  difficulty  encountered  in  symbolic  verification 
of  out-of-order  processor  designs:  the  difiiculty  of  creating  an  appropriate  im¬ 
plementation  abstraction.  The  extension  of  the  incremental  flushing  technique 
enables  significantly  more  automation  than  the  basic  technique  alone  and  re¬ 
duces  the  need  for  manual  abstraction.  On  the  down  side,  the  computational 
complexity  of  the  resulting  proof  obligations  is  higher,  since  more  steps  of  sym¬ 
bolic  simulations  are  performed  in  each  proof  step.  This  was  not  an  issue  in  the 
verification  of  our  very  simple  example.  However,  more  research  is  needed  to 
address  the  application  of  the  approach  to  more  realistic  designs.  Also,  work  is 
needed  to  establish  if  our  techniques  for  avoiding  resource  contentions  during 
reordering  are  sufficient  for  more  complex  architectures. 

It  has  been  argued  that  localizing  the  (possibly  distributed)  scheduling  logic 
in  a  circuit  will  be  difficult.  Our  assumption  of  practice  is  that  optimal  schedul¬ 
ing  algorithms  are  determined  empirically  by  simulation  and  that  location  and 
interfaces  are  clearly  identifiable  when  plugging  in  different  scheduler  imple¬ 
mentations.  We  expect  that  this  knowledge  can  be  exploited  when  locating  the 
scheduling  logic  for  verification  purposes. 
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